-
-
Couldn't load subscription status.
- Fork 10.9k
[Kernel][AMD] Avoid D2H copy and cumsum kernel #22683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel][AMD] Avoid D2H copy and cumsum kernel #22683
Conversation
Current AITER attention for prefill incurs one D2H copy (which causes synchonrization between CPU and GPU) and one cumsum kernel. The D2H copy is used to get the actual KV tokens on the host side for intermediate buffer allocation; this value should not change during the inference, so we can pre-compute it during AttentionMetadata construction time. And similarly, the cumsum kernel can be done in AttentionMedata construction time Signed-off-by: Xiaozhu <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to optimize the AITER attention backend for prefill on ROCm by avoiding a device-to-host copy and a redundant cumsum kernel execution. The changes achieve this by pre-computing cu_seq_lens and num_actual_kv_tokens during the AttentionMetadata construction, which is a sound optimization strategy.
My review includes a suggestion to correct a type hint for the newly added cu_seq_lens attribute.
Additionally, I've identified a potential critical issue in the AiterFlashAttentionImpl.forward method. It appears that for prefill requests, both the prefill attention kernel (flash_attn_varlen_func) and the decode attention kernel (paged_attention_v1) are called sequentially. This would cause the output of the prefill kernel to be overwritten by the decode kernel, likely resulting in incorrect outputs and performance degradation. I strongly recommend wrapping the decode kernel call in an else block to ensure it's only executed for decode requests. While this issue pre-exists this PR, it is critical to fix for the backend to function correctly.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Michael Goin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice find!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some question on a new meta data introduced.
|
@HAIAI @mgoin First PR to vllm oss, so not quite sure about the CIs results. It looks to me tpu tests are not related? the v1-test failure says but seems like it is on NV platform, for which my pr should have no impact... Can you give some advice on the CI failures? |
|
Hi @mxz297 unfortunately the CI has lots of failures at the moment, we usually try to keep it in a better state. Please ignore these for now and I will request force merge. |
Signed-off-by: Xiaozhu <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Xiaozhu <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xiaozhu <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
The PR breaks full graph mode. |
|
@fsx950223 could you file an issue for it please? |
Signed-off-by: Xiaozhu <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Xiaozhu <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Xiaozhu <[email protected]> Signed-off-by: Michael Goin <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Current AITER attention for prefill incurs one D2H copy (which causes synchonrization between CPU and GPU) and one cumsum kernel.

The D2H copy is used to get the actual KV tokens on the host side for intermediate buffer allocation; this value should not change during the inference, so we can pre-compute it during AttentionMetadata construction time. And similarly, the cumsum kernel can be done in AttentionMedata construction time
We can see there is memcpy and gpu bubbles after the change.
Saw about 10% prefill latency improvement
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update